[TLE][MTHREADS] Support TLE Structure on mthreads backend#617
Merged
sunnycase merged 11 commits intoJun 3, 2026
Conversation
sunnycase
reviewed
May 29, 2026
Collaborator
There was a problem hiding this comment.
Thanks for the contribution and for adding TLE structure support for the mthreads backend.
Could you please update the PR description or add supporting documentation to explain which TLE primitives are implemented/supported by this work, and include performance benefit data so reviewers can evaluate whether the implementation scope matches the expected value?
It would be helpful to include:
- The list of implemented TLE primitives, their semantic coverage, and any partial support or known limitations.
- The lowering/runtime path for each key primitive, especially where it differs from the native Triton path.
- Performance data: benchmark cases, input sizes, hardware/driver environment, baseline, before/after results, improvement ratio, and any regression cases.
- If this PR is currently only structural enablement and has no measurable performance gain yet, please state that explicitly and describe the follow-up validation plan.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
MTHREADS backend support for the main TLE Structure primitives in this patch:
tle.gpu.memory_space(x, "shared_memory")ttg.local_alloc+ttg.local_load."shared_memory"is supported on mthreads;"tensor_memory"and other spaces are rejected.tle.gpu.allocttg.local_alloc.nv_mma_shared_layout=True/defaultis not supported on mthreads.tmemallocation is not supported.tle.gpu.local_ptrtle.gpu.copytle.buffered_tensor.Lowering path
Key differences from native Triton:
tle.gpu.memory_space(..., "shared_memory")is consumed early; nott.memory_spacemarker remains after lowering.tle.gpu.local_ptrintroducesmusa_tle.local_pointers, which is later optimized or lowered away before LLVM IR.tle.gpu.copyusesttg.tma_copyas an intermediate but lowers to mthreads/MUSA TME ops such asttmg.async_tme_copy_global_to_local,ttmg.async_tme_copy_local_to_global, and LLVM MUSA TME intrinsics, instead of native Triton TME lowering.tle.gpu.copylowers through load/store plus local pointer paths, with mthreads-specific async-store optimization.Performance Data
Benchmark source:
python/tutorials/tle/01-fft.py.python/tutorials/tle/03-topk.py.Note:
For MTHREADS testing, the tutorial currently requires manually replacing
is_cudawithis_musabefore running.Environment:
Baselines and results on large-shape cases: